Keyword [Attention Map]

Zagoruyko S, Komodakis N. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer[J]. arXiv preprint arXiv:1612.03928, 2016.

1. Overview

1.1. Motivation

different observers with different knowledge, goals lead to different attentional strategies
can a teacher network improve the performance of another student network by providing to it information about where it looks
In this paper, it improves the student network by forcing it to mimic the attention maps of a powerful teacher network.
activation-based and gradient-based attention map

1.2. Contribution

attention mechanism to transfer knowledge
activation-based (better and can combine with knowledge distillation) and gradient-based spatial attention maps

Attention Mechanism
- image caption
- VQA
- weakly-supervised object localization
- classification
Gradient-Based
Knowledge Distillation
- shallow networks has been shown to be able to approximate deeper ones without loss in accuracy
Network
- after a certain depth, the improvements came mostly from increased capacity of the networks (parameter number)
- 16 layer wider ResNet can learn as good or better as very thin 1000 layers one

1.4. Dataset

ImageNet. classification, localization
COCO. obj detection, face recognition amd fine-grained recognition

2. Attention Transfer

2.1. Activation-Based Attention Transfer

get the attention map from the feature maps
first layer. activate for low-level gradient points
middle level. high for discriminative regions (eyes, wheels)
top level. reflects full obj

2.1.1. three methods

stronger networks hace peaks in attention where weak networks don’t
(F_sum)^p put more weight (than F_sum) to spatial locations the correspond to the neurons with the highest activation
(F_max)^p only consider one of the max rather than sum of all

2.1.2. Cases of Student and Teacher Networks

same depth
different depth

2.1.3. Loss Function

L(W, x). task loss
I. pairs of student-teacher attention maps
The normalization of attention maps is important for student training

Attention transfer can also be combined with knowledge distillation.

2.2. Gradient-Based Attention Transfer

If small changes at a pixel can have a large effect on the network output then it is logical to assume that the network is “paying attention” to that pix

flip invariant version

3. Experiments

3.1. Attention-Based

trained with all transfer loss better than only one transfer loss
F_sum better than F_max

4. Compared with Knowledge Distillation